-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenVINO integration for CausalLM models #17
base: main
Are you sure you want to change the base?
Conversation
0692b1c
to
76a44fa
Compare
model_path: str, | ||
model_class: Union[AutoModelForCausalLM, AutoModelForSeq2SeqLM], | ||
dtype: torch.dtype, | ||
quantize: Optional[str], # not used by OpenVINO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to consider quantize
parameter as a trigger to compress the model weights to INT8 or INT4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantize
is currently used for bitsandbytes and GPTQ and using anything else throws an error. We could presumably modify that, but for weight compression it seemed load_in_8bit (which is now the default) and soon load_in_4bit would be a better fit.
Personally I would always compress offline and load the compressed model directly. TGIS requires downloaded weights. So if you want to compress the model on the fly, you would have to download the full precision weights, keep them on disk, and then within TGIS compress the model to 4 or 8 bit every time, which takes several minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @helena-intel. I agree that offline compression is better. But I noticed you allow on-fly conversion here based only logic below where you add a flag kwargs["export"] = True
when it is not model_is_ov
. That is why I ask about on-fly compression for such models as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point! Note that at the moment we already do on-the-fly compression for models with more than 1B parameters because they will be converted to 8-bit by optimum-intel. But it would be good to allow to configure this, especially now that we'll have load_in_4bit in optimum-intel soon. How do you propose to include this? Add a "weight_compression" option for quantize in addition to bitsandbytes and gptq? Or weight_compression_int4 and weight_compression_int8? Currently setting dtype_str to int8 also enables bitsandbytes quantization, so I thought we could use the same, but that doesn't out of the box allow int4 because dtype-str is limited to torch dtypes. That can all be changed, but I would like to get a maintainer's opinion on the best way to do this first.
Another option could be to add an environment variable OPENVINO_WEIGHT_FORMAT and allow specifying an exact config for sym/asym, group size and ratio. Which is the most flexible, but a different API than other inference engines.
Signed-off-by: Helena <[email protected]>
76a44fa
to
6349d91
Compare
6349d91
to
3fc754e
Compare
OpenVINO integration for text-generation-inference.
Known limitations:
It would be great to have a documented option to build the Docker image without GPU dependencies and flash-attention, maybe with a
make cpubuild
option for example.make build-test-image
works fine with this integration.